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Abstract 

Background: High-throughput sequencing allows the detection and quantification of frequencies of somatic single 
nucleotide variants (SNV) in heterogeneous tumor cell populations. In some cases, the evolutionary history and 
population frequency of the subclonal lineages of tumor cells present in the sample can be reconstructed from these 
SNV frequency measurements. But automated methods to do this reconstruction are not available and the conditions 
under which reconstruction is possible have not been described. 

Results: We describe the conditions under which the evolutionary history can be uniquely reconstructed from SNV 
frequencies from single or multiple samples from the tumor population and we introduce a new statistical model, 
PhyloSub, that infers the phylogeny and genotype of the major subclonal lineages represented in the population of 
cancer cells. It uses a Bayesian nonparametric prior over trees that groups SNVs into major subclonal lineages and 
automatically estimates the number of lineages and their ancestry. We sample from the joint posterior distribution 
over trees to identify evolutionary histories and cell population frequencies that have the highest probability of 
generating the observed SNV frequency data. When multiple phylogenies are consistent with a given set of SNV 
frequencies, PhyloSub represents the uncertainty in the tumor phylogeny using a "partial order plot". Experiments on 
a simulated dataset and two real datasets comprising tumor samples from acute myeloid leukemia and chronic 
lymphocytic leukemia patients demonstrate that PhyloSub can infer both linear (or chain) and branching lineages and 
its inferences are in good agreement with ground truth, where it is available. 

Conclusions: PhyloSub can be applied to frequencies of any "binary" somatic mutation, including SNVs as well as 
small insertions and deletions. The PhyloSub and partial order plot software is available from https://github.com/ 
morrislab/phylosub/. 



Background 

Cancer is a complex disease often associated with a char- 
acteristic series of somatic genetic variants [1,2]. Substan- 
tial effort has been devoted to genetic profiling of tumors 
in hopes of identifying these driver mutations and study- 
ing how they drive tumor development and resistance 
to treatment [3]. Tumors often contain multiple, genet- 
ically diverse subclonal populations of cells [4], and in 
some cases it is possible to reconstruct the evolutionary 
history of the tumor, thereby aiding in the identifica- 
tion of driver mutations, by computing the population 
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frequencies of mutations that distinguish the subclonal 
populations [5-13]. 

Somatic mutations can be detected, and roughly quanti- 
fied, using exome and whole genome sequencing of a sam- 
ple from a bulk tumor [14]. However, recent attempts to 
reconstruct subclonal phylogenies have employed much 
deeper targeted sequencing [15] of tumor-associated sin- 
gle nucleotide variants (SNVs) to achieve higher accuracy 
in estimated SNV frequencies [9,10,16,17]. These SNV 
frequencies were then used to partially reconstruct the 
evolutionary history of tumors based on a single [10,16] 
or multiple [9] samples of same tumor. However, due to 
short read sequencing, the frequencies of different SNVs 
are measured independently, so linkage between the SNVs 
in subclones is unavailable and standard phylogenetic 
methodology can not be used to construct evolution- 
ary histories (as done in [18] or [17]). However, if one 
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makes the infinite sites assumption about tumor evolu- 
tion, namely that every SNV only appeared once, then 
it is possible to use SNV frequencies to automatically 
reconstruct full or partial subclonal phylogenies while 
also inferring the multiple SNV genotypes of the major 
subclonal lineages in the tumor. 

Here we describe a new method that automatically per- 
forms this phylogenetic reconstruction. First, we demon- 
strate that an unambiguous reconstruction is possible by 
describing topological constraint rules that are sufficient 
conditions to infer whether a triplet of SNV frequencies 
is consistent with only a chain or a branching phylogeny. 
We then describe a new method, PhyloSub, that auto- 
matically infers tumor phylogenies from SNV allele fre- 
quencies measured in single or multiple tumor samples. 
PhyloSub is based on a generative probabilistic model, 
inference in which implicitly implements the two rules by 
inferring the hidden phylogenies that have high probabil- 
ities of generating the observed SNV frequencies. It uses 
Bayesian inference, based on Markov Chain Monte Carlo 
(MCMC) sampling, to infer a distribution over phyloge- 
nies that incorporates uncertainty due to multiple phy- 
logenies being consistent with the SNV frequencies and 
also noise in the measurement of the SNV frequencies. 
PhyloSub uses a Dirichlet process prior over phylogenies 
[19] to group SNVs into major subclonal lineages. 

Model assumptions 

We assume that the tumor evolution proceeds according 
to the clonal evolution theory, namely that all tumor cells 
are derived from ancestors that gain growth advantages 
over normal tissue and begins to expand [18]. Subsequent 
mutations can provide a further fitness or survival advan- 
tage to their subclonal lineage [20] which subsequently 
increases in frequency compared to cells containing only 
the SNVs in the parental lineage. A given tumor sample is 
a snapshot of this evolutionary process and may contain, 
at non-negligible frequency, cells from multiple major 
subclonal lineages, each containing a different assortment 
of these advantageous mutations. We make the infinite 
sites assumption [21,22], namely that each SNV appears 
only once and furthermore that once it appears, it does not 
revert back to its original state. As we describe below and 
illustrate in Figure 1, in some circumstances, this assump- 
tion highly constrains the phylogenies that are consistent 
with the SNV allele frequency data, especially if SNV fre- 
quencies from multiple samples from the same tumor 
are available. Finally, to make our model robust to low 
tumor cellularity, we assume that each tumor is derived 
from a single clone, however, this assumption is not crit- 
ical in modeling tumor evolution and we revisit this 
assumption in the discussion section where we describe 
how to generalize our model to multicentral tumors 
(e.g., [23]). 



To simplify our initial discussion, we will assume that 
the exact population frequencies of the cells containing 
each SNV (i.e., the SNV population frequency) are avail- 
able before discussing how we estimate these frequencies 
from deep sequencing data of the SNV locus. Note, we 
assume that the copy number of a locus is available as 
per [10]. In the datasets that we considered, most SNVs 
are heterozygous at a normal copy number locus and 
the population frequency of other SNVs is easily inferred 
from their allele frequencies. In more complex situations, 
a number of tools are available to infer copy number 
changes associated with specific subclonal lineages from 
whole genome sequencing data [11,13]. 

An important consequence of the infinite sites assump- 
tion is that if SNV B occurred in a cell that contained 
SNV A, then all cells that have B also have A and thus the 
population frequency of A must always be greater than or 
equal to that of B, regardless of where and when the tumor 
sample was taken. However, a given set of three SNV 
population frequencies can still be consistent with two 
different phylogenies: a linear phylogeny or a branching 
phylogeny (see Figure lA). 

Topological constraint rules 

One can distinguish linear or branching descent under 
some circumstances. For example, if we have already 
established that SNV A is ancestral to both B and C (i.e., 
that all cells with B or C also contain A), then if the popu- 
lation frequency of B plus the population frequency of C is 
greater than the population frequency of A, then the phy- 
logeny must be linear. This is true because in a branching 
phylogeny, there are no cells that contain both B and C, 
so the population frequency of A must be at least as large 
as sum of the frequencies of B and C (see Figure IB). We 
call this the "sum rule". However, because a linear phy- 
logeny is consistent with any set of SNV frequencies from 
a single sample, without making any further assumptions 
about the tumor evolution process, one needs at least two 
tumor samples to be able to rule out a linear phylogeny. 
However, given two samples and again assuming that SNV 
A is ancestral to both B and C, if the population frequency 
of B is larger than that of C in one sample, and vice versa 
in the other, than neither B nor C can be ancestral to the 
other, and the only phylogeny consistent with both sets 
of SNV frequencies is the branching one. We call this the 
"crossing rule" because the frequencies of B and C cross 
(see Figure IC for an example). However, there is no guar- 
antee that one can apply either rule to any set of SNV 
frequencies for all triplets of SNVs, although increasing 
the number of tumor samples does make it more likely 
that either the sum or crossing rule will be applicable for 
one or a pair of tumor samples, respectively. Furthermore, 
one needs to also consider the possibility of estimation 
error in the SNV population frequencies because these are 
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Figure 1 Visualization of topological constraint rules. (A) A, B, C are three SNVs, each of which represents a set of SNVs with similar SNV 
population frequencies. When the SNV population frequencies are 0.8, 0.4, 0.2 (left panel), there might be two possible phylogenies that are 
consistent with these frequencies (middle panel). The two solutions estimate same number of clonal populations but different genotypes for each 
clone. The decomposition of the clonal population frequencies are shown on the right panel. (B) Because of the sum rule, for this given set of SNV 
population frequencies 0.8, 0.6, 0.4, a chain structure may be the only possible phylogeny to explain the frequency changes. (C) Under the crossing 
rule, when multiple samples from the same patient are taken, we would expect the phylogenies are shared between samples. When another set of 
frequencies are observed 0.8, 0.2, 0.4, the branching structure is the only possible phylogeny to explain the frequencies changes for both sample 1 
and 2. 



inferred from discrete read counts. Note that these two 
rules also apply where SNV A is a mock SNV represent- 
ing the wildtype state and having population frequency of 
100%; as such these two rules also apply for multicentral 
tumors. 

The PhyloSub algorithm 

To explicitly model uncertainty in estimates of the SNV 
population frequencies and the precise tumor phylogeny, 
we have developed the PhyloSub model that we describe 
here. PhyloSub attempts to explain the observed read 
counts in terms of a latent phylogeny that associates SNVs 
with particular subclonal lineages. We provide software 
that takes as input a set of read counts for a set of 
SNVs and the copy number status of each SNV, performs 
inference in the PhlyoSub model to estimate the num- 
ber of major subclonal lineages, the mutational profile of 
each lineage, and the proportion of each lineage within 
the tumor cell population from which the read data was 
drawn. PhyloSub implements the parsimony assumptions 
detailed above using a non-parametric prior over tree 
structures. It is "generative" in that it attempts to explain 
the observed SNV frequencies in terms of an unobserved 
phylogeny; our model is also "Bayesian" in that it infers 
a posterior distribution over phylogenies and associated 



subclonal lineage frequencies. We introduce a new visu- 
alization, the partial order plot, to represent the posterior 
uncertainty in the phylogeny when the SNV frequencies 
alone do not provide sufficient information to uniquely 
reconstruct the phylogeny (Figure 2). The sum and cross- 
ing rule described above are implicitly incorporated into 
our generative model - our model assigns very low proba- 
bility to any read counts that reflect deviations from either 
rule. 

In the following, we provide a brief introduction to 
the PhyloSub model (see Section "Methods" for the full 
model) and we demonstrate its application to datasets 
where a single sample is profiled [17] and those where 
multiple samples are profiled [9] . We also report the appli- 
cation of the model on a simulated dataset to show that 
its prior parameterization allows it to represent a wide 
variety of phylogenies. 

Results and discussion 

PhyloSub 

PhyloSub represents the major subclonal lineages and 
their evolutionary relationships using a directed tree in 
which each node is associated with a subclonal lineage 
and the edges connect parental lineages to their direct 
child lineage. Each subclonal lineage is associated with a 
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Figure 2 Derivation of partial order plot. (Left column) Posterior 
distribution over trees, eacli tree lias a 0.5 probability under the 
posterior. (Right column) Derived partial order plot. Nodes correspond 
to SNVs. Edge thickness is proportional to the posterior probability 
that parent SNVs is in a subclonal lineage that is the parent of the one 
containing the child SNV. SNV nodes are ordered based on a 
topological sort implemented as the "layered graph drawing method" 
in Graphviz [24]. In this example, SNV A is always in a subclonal lineage 
that is the parent of B but the same is true with probability 0.5 for C. 



distinct subset of the SNVs input to the model, we call this 
subset the genotype of the lineage. Each node is also asso- 
ciated with (i) a set of SNVs that are present in this lineage 
but not its parent lineage and (ii) the population frequency 
of cells with the lineage genotype (and with no other SNVs 
from the input set). A subclonal lineage contains all of the 
SNVs associated with its parent, so its full genotype can 
be reconstructed by taking the union of the SNVs associ- 
ated with its node and all of its ancestral nodes. Similarly, 
the population frequency of an SNV is the sum of the 
subclonal lineage frequencies of the lineage it appeared in 
and all of its descendent lineages. So, the subclonal lineage 
tree can be used to compute the population frequencies 
of each SNV and the genotype of each subclonal lineage. 
Associated with each SNV is a variable that indicates its 
zygosity and copy number in the cells that it appears (e.g., 
Aa indicating heterozygous and normal copy number), we 
assume all cells with the SNV have the same zygosity and 
copy number, and that all other cells have normal copy 
number at the SNV locus. The SNV genotype variable 
along with the population frequency is used to compute 
the allelic frequency of the SNV /, pi. The data input to 
the model for each SNV is the number of reads mapping 
to the SNV locus, di, and the number of these reads that 
do not contain the SNV, Ui, We evaluate the likelihood of 
a given subclonal lineage tree (including the lineage pop- 
ulation frequencies and the SNV genotype variables) by 
taking the product of the read count probabilities for each 



SNV, where the probability for the locus of SNV / is com- 
puted using a binomial distribution whose parameter is 
derived from pi and an estimate of the error rate of the 
sequencer. PhyloSub also contains a vague prior over tree 
structures that is parameterized by three hyperparameters 
(ofo, K,A.) (see Section "Methods") that govern how the 
prior scores trees with more or fewer nodes, and differ- 
ent average numbers of siblings. We use ranges for these 
hyperparameters that in simulations have a slight prefer- 
ence for trees with fewer nodes but a limited preference 
for sibling numbers (see below for details). 

Simulations 

PhyloSubs Dirichlet process prior over tree structures 
depends on three hyperparameters: ao, y, and A. The 
hyperparameters ao and X determine the number of nodes 
(subclones) in the tree, k also affects the height of the tree 
and y affects the number of siblings in the tree which 
in turn affects the width of the tree. In all the experi- 
ments, we sample these hyperparameters [19] as part of 
the MCMC sampling from a range whose upper and lower 
bounds we establish in this section. 

To establish the ranges that we use for the hyperpa- 
rameters in PhyloSub, we simulated read counts from 
clusters with an average of nine SNVs per cluster with 
SNV population frequencies {1.0, 0.85, 0.6, 0.35, 0.2, 0.08}, 
with a read depth of ^ 10, 000 x which is a typical 
read depth for the targeting deep sequencing data that 
PhyloSub is designed for. We simulated heterozygous 
SNVs at loci with normal copy number and sample 
read counts for each SNV from a Binomial distribution 
(see Section "Methods"). The hyperparameter settings 
we used in the simulations are all possible combinations 
of ao G {1,2,4,10,20,50}, 7 G {1, 2, 4, 6, 8} and A. G 
{0.25,0.5,1}. The SNV population frequencies are con- 
sistent with many different tree structures and Figure 3 
shows that the tree structures with highest complete-data 
likelihoods varies in the expected way for different set- 
tings of the tree prior hyperparameters. Although the 
preferred structure varies, the inferred SNV frequencies 
remain well-correlated with the baseline values (Pearson 
correlation > 0.99) for these hyperparameter ranges, so 
the prior is not over-regularizing the SNV frequencies 
for these settings. To allow a range of tree structures, we 
integrate over these ranges by placing a uniform prior on 
the choice of these settings in our MCMC simulations 
(c.f., [19]). 

Although we focused on high read depths in the above 
simulation, we found that PhyloSub works well for read 
depths ^ 1, OOOX and was able to recover the clusters sim- 
ilar to the ones reported above and the SNV frequencies 
are well-correlated with the baseline values (Pearson cor- 
relation > 0.99). However, we found that the performance 
of the model degrades slightly at a read depth of ^ 200X, 
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ao = 20,7 = 1, A = 0.5 ao = 50, 7 = 1, A = 0.25 ao = 50, 7 = 1, A = 0.5 

Figure 3 Results on simulated dataset. Best tree structures, i.e., the ones with the highest complete-data lil<elihood, estimated by PhyloSub with 
varying hyperparameters for the simulated dataset. We only show a subset of the trees from 90 MCMC runs corresponding to all possible 
combinations of the hyperparameters used in the simulation. 



due to merging of clusters whose nearby SNV frequen- 
cies could not be distinguished. Nonetheless, we note that 
the inferred SNV population frequency estimates remain 
well-correlated (Pearson correlation > 0.96) and that the 
majority of the clusters were recovered at read depth ^ 
200 X. 

The simulation as described above has no clear phy- 
logeny by design. The SNV frequencies were consistent 
with multiple phylogenies and the main goal of this simu- 
lation was to establish the ranges for our hyperparameters 
that permit a wide variety of tree structures. We integrate 
over these parameter ranges on the real data in order to 
remove any prior bias towards particular structures. To 
determine whether PhyloSub can correctly recover the 
phylogenies from a single sample of SNV frequencies, we 
simulated read data from a chain phylogeny with SNV 
population frequencies 0.9 0.75 0.55 0.4 
0.25. By the sum rule, these frequencies are only consis- 
tent with a chain phylogeny. PhyloSub was able to reliably 
recover this chain. The real datasets described in the 
later sections are representative of the types of problems 
that our methodology could be applied to as they con- 
tain single and multiple samples, some of which have clear 
phylogenies and some do not. 

Results on AML datasets 

To assess PhyloSub on single samples, we applied it to data 
from Jan et al. [17] who reported the coexistence of mul- 
tiple subclonal lineages in hematopoietic stem cells (HSC) 
from acute myeloid leukemia (AML) patient samples. The 
deep-targeted sequencing of all SNV candidates identified 
by exome sequencing identified SNVs with differing allelic 
frequency, suggesting multiple clonal populations in the 
HSC cells. An independent single-cell assay confirmed the 
existence of multiple clones, and thus provides a ground 
truth tree that shows some of the major subclonal lin- 
eages within the populations. Here we apply PhyloSub 
to the two samples profiled by Jan et al. that had three 



or more SNVs profiled in a single-cell assay. These sam- 
ples are SU048 and SU070 which have 6 and 10 SNVs in 
the single-cell assay, respectively. Although this assay con- 
firmed the presence of some of the subclonal lineages, 
only 100-200 cells were assayed, so lineages with low pop- 
ulation frequency in the sample (e.g., < 1%) may not be 
detected. 

We applied PhyloSub providing it with the copy num- 
ber and zygosity of each SNV (results were similar if we 
assume normal copy number and have a uniform prior 
on zygosity). For both SU048 and SU070, a number of 
different phylogenies were consistent with the SNV read 
counts, and we developed the "partial order plot" to rep- 
resent the posterior uncertainty in the phylogeny (see 
Figure 2 and Section "Methods"). 

Figure 4 shows that partial order plot for SU070. The 
ordering of the nodes in the partial order plot can also 
be used to infer ancestry via transitivity, for example, in 
Figure 4, the SNV CXorf66 has high probability of being 
in the subclonal lineage that is the direct parent of the one 
that DOCK9 is in, however, because the TET2-T1884A 
SNV is sorted before CXorf66 (and has a small probability 
of being a direct parent of it), then in the PhyloSub poste- 
rior over lineages, TET2-T1884A has a high probability of 
being in an ancestral lineage to the one CXorf66 is in. 

Furthermore, one can interpret the partial order plot 
to indicate that both CXorf36 and CXorf66 are in the 
same lineage because they are both direct parents of 
DOCK9 (with high probability) and there are no edges 
between them. For reference, in Figure 4 we have included 
the results of the single-cell assay for SU070 in the par- 
tial order plot representation - Jan et al. report three 
subclonal lineages for SU070, as indicated by the SNV 
colorings [17]. We note that these plots are largely consis- 
tent. Indeed, we assign high posterior probability > 0.96, 
to two of the three subclonal lineages detected by Jan 
et al. (see Additional file 1: Table S3 for full lineage geno- 
type probabilities). For reference, we also provide the list 
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Figure 4 Clonal evolutionary structures of tumor sample SU070. (A) Ground truth from Jan etol. [17]. (B) PhyloSub's output summarized using 
a partial order plot. For clarity, we removed edges with probability less than 0.1 before laying out the nodes and we only show SNVs for which 
single-cell data is available. However, all the SNVs whose frequencies were reported in this study were used in the inference and (Additional file 2: 
Figure SI) shows the full partial order plot. The color of the border of the SNVs represents the subclonal lineage cluster that the SNV is placed into by a 
graph-based clustering algorithm that takes as input the co-clustering frequencies from the MCMC samples (see Section "Methods"). Note that unlike 
the thickness of the edges, this is simply a visualization aid, and does not fully represent the model's posterior uncertainty in the SNV clusterings. 



of the subclonal lineage trees along with their posterior 
probabilities in see (Additional file 1: Table SI). 

The one major difference between PhyloSubs estimates 
and the single-cell data from Jan et al. is that PhyloSub 
switches the order of the appearance of SNVs CXorf36 
and TET2-T1884A. In fact, there was not a single sub- 
clonal lineage that contained CXorf36 but not TET2- 
T1884A in 5,000 subclonal lineage trees sampled from 
PhyloSubs posterior. This switch is likely due to the 
observed SNV frequencies, indeed the 95% confidence 
intervals of the SNV frequencies of these two SNVs do 
not overlap (Table 1). One explanation for this difference 
is a systematic bias in the measurement of one or both of 
these SNVs; it is also possible that the labels of these two 
SNVs were switched in Jan et al. We also note, however, 
that in Jan et al, the existence of the lineage that contains 
only CXorf36, TET2-Y1649stop, and CACNAIH is only 
supported by 2 of the 189 clones that they profiled. 

For the tumor sample SU048, both the partial order plot 
and the single-cell assay agree on TET2-E1357stop event 
occurring early (at the root of the tree), and all other SNVs 
are secondary mutational events as shown in Figure 5B. 
Note that the partial order plot shows a large uncertainty 
in the structure for the rest of the SNVs and this is also 
reflected in the posterior over subclonal lineage trees and 
genotypes (see Additional file 1: Tables S2 and S4, respec- 
tively). There is no strong evidence for either a linear or 



branching lineage or for particular clustering among these 
SNVs. Also, from Table 2, we see a lot of variation in the 
allele frequencies of these SNVs suggesting that they may 
not belong to the same subclonal lineages. The subclonal 
lineage inferred by Jan et al/s single-cell assay is shown 
in Figure 5A and only contains two lineages, one with 
only TET2-E1357stop and the other with the other five 
SNVs. The TET2-E1357stop lineage genotype has proba- 
bility 0.81 in our posterior, however the second genotype 
has a relatively small probability (0.06) under the poste- 
rior although we note that the genotype that contains all 
SNVs but ZMYM3 has a posterior probability of 0.32 (see 
Additional file 1: Table S4). For reference, we also pro- 
vide the list of the subclonal lineage trees along with their 
posterior probabilities in (Additional file 1: Table S2 ). 

In summary, the subclonal lineage trees inferred by 
PhyloSub on single samples of SNV frequencies are largely 
consistent with ground truth but there remains substan- 
tial uncertainty in SU048 about whether there was a linear 
or chain lineage. On the other hand, the SNV frequen- 
cies in SU070 were only consistent with a linear lineage 
and PhyloSub almost perfectly reconstructed the results 
of the single-cell assay with one misordering of the SNVs. 
This difference may be explained by unmodeled system- 
atic biases in the deep sequencing data or experimental 
error. Nonetheless, we have shown that in some cases, 
it is possible to achieve a good estimate of the genotype 
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Table 1 Allelic counts for tumor sample SU070 from Jan et at. [1 7] 


SNV 


Variant allele read counts 


Read depth 


Allele frequency 


Cluster ID 


CACNAIH 


12,085 


24,860 


0.486 (95% CI: 0.481 -0.491) 


A 


TET2-T1 884A 


4,220 


8,772 


0.481 (95% CI: 0.472-0.490) 


B 


TET2-Y1649stop 


7,792 


16,211 


0.481 (95% CI: 0.474-0.487) 


A 


CXorf66 


3,684 


8,150 


0.452 (95% CI: 0.443-0.461) 


B 


CXorf36 


3,523 


8,060 


0.437 (95% CI: 0.428-0.446) 


A 


D0CK9 


3,391 


8,676 


0.391 (95% CI: 0.382-0.400) 


C 


NCRNA00200 


9,201 


25,413 


0.362 (95% CI: 0.357-0.367) 


C 


CTCF 


10,558 


30,119 


0.351 (95% CI: 0.346-0.355) 


C 


GABARAPLl 


1,648 


4,992 


0.330 (95% CI: 0.3 19-0341) 


C 


SCN4B 


5,113 


16,386 


0.31 2 (95% CI: 0.306-0.31 8) 


C 



of multiple subclonal lineages as well as their evolution 
from a single, targeted deep sequencing sample of SNV 
frequencies. 

Results on CLL datasets 

To evaluate PhyloSub on a multiple sample dataset, we 
used data from a study of chronic lymphocytic leukemia 
(CLL) by Schuh et al. [9] which quantified SNV frequen- 
cies of a set of SNVs during different time points spanning 
the patient therapy cycle. The candidate SNVs were iden- 
tified by exome sequencing and then subjected to targeted 
resequencing. The tumor samples from the three patients 
in the study labeled CLL077, CLL006 and CLL003 have 
11, 16 and 20 SNVs respectively with SNV frequencies 
for five different time points. Originally, Schuh et al 
reconstructed the evolutionary histories of each cancer 
by a semi-manual procedure in which they first automati- 
cally grouped SNVs into subclonal lineages using /c-means 



clustering on the allele frequencies and the differences 
in allele frequencies between the time points for each 
patient and then reconstructed the evolutionary struc- 
ture of those lineages using a procedure that they do not 
describe in the paper. In PhyloSub, we model multiple 
samples from the same cancer as sharing the same evo- 
lutionary history but we allow subclonal frequencies to 
change between samples. 

We applied PhyloSub to the SNV read count data, pro- 
viding the algorithm with the likely zygosity estimates - 
in most cases, SNVs appeared to be heterozygous with 
normal copy number but in a few cases, SNVs appeared 
to be hemizygous and were input to the model as such. 
For these data, because of the multiple samples per tumor, 
there is very little posterior uncertainty in the best fitting 
tree - as such, we only show the best single tree struc- 
ture corresponding to the MCMC sample with the highest 
complete-data likelihood [19]. 




Figure 5 Clonal evolutionary structures of tumor sample SU048. (A) Ground truth from Jan etol. [1 7]. (B) PhyloSub's output summarized using 
a partial order plot. See Figure 4 legend for more details on partial order plot. 
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Table 2 Allelic counts for tumor sample SU048 from Jan et at. [1 7] 


SNV 


Variant allele read counts 


Read depth 


Allele frequency 


Cluster ID 


TET2-E1357stop 


7,436 


19,553 


0.380 (95% CI: 0.375-0.386) 


A 


SMCIA 


1 82,974 


660,069 


0.277 (95% CI: 0.276-0.278) 


B 


ACSMl 


17,149 


127,236 


0.135 (95% CI: 0.1 33-0.1 36) 


B 


0LFM2 


13,828 


122,523 


0.1 13 (95% CI: 0.1 11 -0.1 14) 


B 


TET2-D 1384V 


1,833 


1 7,687 


0.104 (95% CI: 0.1 00-0.1 07) 


B 


ZMYM3 


18,536 


307,346 


0.060 (95% CI: 0.060-0.061) 


B 



For the tumor samples CLL077 and CLL003, the best 
tree structure estimated by PhyloSub and the tree struc- 
ture from Schuh et al. [9] are in exact agreement and 
the population frequencies of the subclonal lineages are 
well-correlated. Figures 6 and 7 compare the PhyloSub 
estimates with those reported by Schuh et al, [9] . 

For the tumor sample CLL006, PhyloSub inferred a 
chain structure similar to the chain structure from Schuh 
et al, but the major difference in PhyloSub s best esti- 
mate of the tree structure is the splitting of cluster A 
into two clusters as shown in Figure 8. However, we 
found that the complete-data log likelihood of PhyloSub s 
best estimate of the tree structure is higher than the 
one for the chain structure of Schuh et al. and therefore 
PhyloSub prefers the splitting of the cluster A into two 
clusters. 

In the CLL dataset, there is no ground truth but to allow 
the reader to compare the two estimates of the evolution- 
ary history, Figure 9 plots the frequency of each SNV in 
the three samples, and we have colored SNVs according to 



their subclonal lineage assignments by Schuh et al These 
SNV frequencies are not corrected for copy number, how- 
ever, the hemizygous SNVs are clear from examination of 
the figure. 

In summary, having multiple samples of SNV fre- 
quencies greatly reduces the posterior uncertainty in 
the evolutionary history of the tumor and PhyloSub is 
able to reconstruct histories produced by a semi-manual 
procedure. 

Conclusions 

We presented a nonparametric Bayesian model called 
PhyloSub that uses a Dirichlet process prior over trees [19] 
to model the clonal evolutionary structure of tumors from 
next generation sequencing data. We also introduced a 
new visualization method, the partial order plot, to rep- 
resent the posterior uncertainty in the phylogeny when 
the clonal frequencies alone do not provide sufficient 
information to uniquely reconstruct the phylogeny and 
mutational profiles of each subclonal lineage represented 
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SAMHD1;SLC12A1 




SAMHDl: SLC12A1 






C: DAZAPl; EX0C6B; GHDC; 


c: DAZAPl; EX0C6B; GHDC; 




0CA2; PLA2G16 




0CA2; PLA2G16 






D: LRRC16A 




d: LRRC16A 






E: KLHDC2; COL24A1; NODI; 


e: KLHDC2; COL24A1; NODI; 




HMCNl; MAP2K1 




HMCNl; MAP2K1 






Figure 6 Clonal evolutionary structures of tumor samples from patient CLL077. (Left) Baseline tree structure from Schuh etol. [9] (Right) 


Best tree structure estimated by PhyloSub. The SNV population frequencies and the cluster assignments are also shown in the figure. 
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A: - 

B: ADADl; CHTF8; HERC2; ILllRA; 

SF3B1; SHROOMl 
C: ASXLl; MUSK; SEMA3E 
D: NPY; NRG3; FAT3; CHRNB2 
E: AMTN; APBB2; ATM; BPIL2; 

PLEKHG5; MTUSl; SPTANl 



b: ADADl; CHTF8; HERC2; ILllRA; 

SF3B1; SHROOMl 
c: ASXLl; MUSK; SEMA3E 
d: NPY; NRG3; FATS; CHRNB2 
e: AMTN; APBB2; ATM; BPIL2; 

PLEKHG5; MTUSl; SPTANl 



Figure 7 Clonal evolutionary structures of tumor samples from patient CLL003. (Left) Baseline tree structure from Schuh etal. [9]. (Right) 
Best tree structure estimated by PliyloSub. Tine SNV population frequencies and the cluster assignments are also shown in the figure. 



in the tumor. By enforcing a set of structural constraints 
on the SNV population frequencies using MCMC meth- 
ods, we were able to infer the phylogenetic relationships 
between subclones from both single and multiple tumor 
samples. 

We have demonstrated that it is possible, in some cases, 
to detect a linear lineage from a single, high cellularity 
sample of the tumor. We have also shown that multi- 
ple samples highly constrain the possible lineages that 
are consistent with the SNV frequency data. PhyloSub s 
inferred subclonal lineage trees were in good agreement 
with single cell assays on single sample data and with an 
expert-driven, semi-manual reconstruction procedure on 
multiple sample data. 

PhyloSub s ability to detect and characterize subclonal 
lineages depends on the frequency of the lineage in the 
population (compared to its descendant lineages), the num- 
ber of SNVs that define the lineage, as well as the accu- 
racy with which the SNV population frequencies are 



estimated which depends on both the sequencing depth 
as well as uncertainty about the copy number of the SNV. 
Simply put, for lineages defined by a single SNV, the 
read depth has to be high enough that the uncertainty in 
the estimated SNV frequency is less than the frequency 
of the subclonal population. Having more lineage- 
defining SNVs can relax this hard constraint. As such, 
the phylogenies of tumors with large numbers of sub- 
clonal lineages, each defined by a small number of SNVs 
(possibly due to a pronounced hypermutability pheno- 
type), will be hard to reconstruct with PhyloSub, or any 
other method, unless the SNV frequencies are very accu- 
rately estimated. Indeed, it is not clear how ground truth 
could be uncovered in such a case: the gold standard 
of single cell sequencing would require an exceptionally 
large number of single cells to survey this highly hetero- 
geneous population, and each of these cells would need 
to be sequenced deeply in order to ensure precise somatic 
variant calling. 
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A: ARHGAP29- EGFR- KIAA0182; KLHL; 

MED 12; PIL^IB; SIKI 
B: U2AF1 
C: KIAA0319L 
D: IRF4 
E: RBPJ 



a: ARHGAP29; KIAA0182; KLHL; 

MED12; PILllB; SIKl 
b: EGFR 
c: U2AF1 
d: KIAA0319L 
e: IRF4 
f: RBPJ 



Figure 8 Clonal evolutionary structures of tumor samples from patient CLL006. (Left) Baseline tree structure from Schuh etal. [9]. (Right) 
Best tree structure estimated by PhyloSub. The SNV population frequencies and the cluster assignments are also shown in the figure. 
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One potential difficulty in scaling our approach to 
orders of magnitude more SNVs is that the Markov chain 
may not mix in a timely manner, in other words, may 
get stuck in local minima. We note that finding subopti- 
mal solutions is an issue for any method based on these 
data. In our case, the mixing time of the chain would 
depend largely on the number of subclones represented 
in the population with less of a dependence on the num- 
ber of SNVs. There are various techniques for determining 
whether or not a Markov chain is well-mixed and we refer 
the reader to a recent excellent review [25]. 

PhyloSub extends recent work on inferred cellularity and 
subclonal structure from somatic mutations. ABSOLUTE 
uses whole genome sequence data or array CGH data to 
identify regions of copy number change in the tumor and 
based on this infers cellularity and copy number changes 
associated with different subclones [11]. THetA [13] also 
attempts to infer both the copy number profiles and their 
relative proportions using the whole genome sequencing 
data based on an infinite sites assumptions. Neither of 
these algorithms explicitly reconstructs tumor phyloge- 
nies. Our work is closest to PyClone [10] which uses a 
flat Dirichlet process mixture model to group SNVs into 
subclonal lineages based on their frequencies; PhyloSub 
extends this work by reconstructing the phylogenetic rela- 
tionships among these lineages and, in doing so, allows 
the full SNV genotype of each subclonal population to be 
reconstructed. 

We designed PhyloSub to assume a single clonal origin 
for the cancerous cells in the sample. We made this deci- 
sion to increase the applicability of the sum rule for low 
cellularity tumors (i.e., tumors with high normal contam- 
ination). However, removing this assumption would be a 
simple change to the model, which we have not evaluated. 



Another area of future innovation would be in mod- 
eling sequencing biases and uncertainty in SNV allele 
frequencies resulting from them. We did not evaluate 
replacing our binomial model with a negative binomial 
one that would have allowed greater variability in the 
observed read counts for a given SNV allele frequency 
[26]. 

Methods 

Dirichlet process mixture models 

Consider the problem of clustering A/" objects {xi}^-^ using 
a Bayesian finite mixture model of K components (clus- 
ters) with the following generative process [27]: 

(o ~ Dirichlet (a //<C, . . . ,a/K); Zi ^ Multinomial (w); 
(l)k ~ H; Xi ~ 

(1) 

where co are the mixing weights such that o)k = hot 

is the concentration parameter of the symmetric Dirich- 
let prior placed on the mixing weights, Zi G {1, . . . ,K} 
is the cluster assignment variable, H is the prior distri- 
bution from which the component parameters {0/^} are 
drawn, is the component distribution parameter- 

ized by 0. The finite mixture model can be extended to a 
model with an infinite number of mixture components by 
replacing the Dirichlet prior with a Dirichlet process (DP) 
prior resulting in what is known as the DP mixture model 
(DPMM) [28]. 

Unlike finite mixture models, DPMMs automatically 
estimate the number of components from the data thereby 
circumventing the problem of fixing the number of com- 
ponents a priori. The stick-breaking construction [29] 
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given below provides a precise recipe to draw samples 
from a Dirichlet process: 

k-i 

^^-Beta(l,a); c^i = fii; c^k = PkY\a - 

1=1 

oo 
k=i 

(2) 

where 5^ is a point mass centered at 0 and Q ~ DV(a,H), 
i.e., ^ is a draw from a DP with base distribution H and 
concentration parameter a. The stick-breaking process 
can be viewed as recursively breaking sticks of length 
nJliCl — starting with a stick of unit length. The 
beta variates {P^} determine the random location at which 
the stick is broken. The concentration parameter a deter- 
mines the number of clusters with high values resulting 
in large number of clusters. Let GEM(Qf) denote the stick- 
breaking process over oo. Replacing the Dirichlet prior in 
the finite mixture model 1 with the stick-breaking pro- 
cess prior results in the following generative process for 
infinite mixture models: 

(o ~ GEM(Qf); Zi ~ Multinomial (w); 
(j)k H; Xi F(0z,.). 

An alternative view of the above generative process pro- 
duces component parameters {0/} by drawing samples 
from Q resulting in the following generative process: 

67-DP(a,//); 0,-67; Xi ^ F(4>i). (3) 

Note that in the above process every object {xi}^-^ is 
associated with a component parameter {4>i}^i and that 
all objects assigned to the same cluster will have the same 
component parameter. In other words, multiple elements 
in the set {4>i}^i will take on the same value from the set 
{(l>k}f=i of unique parameters. 

Tree-structured stick-breaking process 

The stick-breaking construction (2) described above can 
be used to produce a flat clustering of objects, where the 
clusters are independent of each other. Adams et al [19] 
extended this construction for hierarchical clustering by 
interleaving two stick-breaking processes. This construc- 
tion results in a relational clustering of objects where 
the clusters are connected to form a rooted tree struc- 
ture. Unlike classical hierarchical clustering algorithms 
such as agglomerative clustering, this construction allows 
data to reside in the internal nodes of the tree; a fea- 



ture we exploit to model the association of SNVs with 
subclonal lineages. 

We borrow notation from Adams et al. [19]. Let e = 
(6i, . . . , ep) denote a sequence of positive integers used to 
index the nodes of the tree. Let e = k denote the zero- 
length string, i.e., the root of the tree. Let |€| indicate the 
length of the sequence € and therefore the depth of node 
€. Let €€i denote the sequence formed by appending 6/ to 
€. The children of node € is the set {€6^ : 6/ G 1, 2, . . .} and 
let the ancestors of € be denoted by the set {e^ : €^ < e}. 
The interleaved, two-layered stick-breaking construction 
is as follows: 

~ Beta(l,Qf(|€|)); ij/e ^ Beta(l, y); co^ = V/c; 

ei-l 

0)e = Ve(Pe ]"[ (Pe^il - V^/); (pe^i = feet ]"[ ~ ^^P* 
e'<e j=l 

(4) 

The and (1 — v^) determine the amount of mass allo- 
cated to € and its descendants respectively, whereas {cpe} 
determines the probability of a particular sequence of chil- 
dren. The construction ensures that the mixing weights 
{ooe} sum to one. The parameters a and y control the 
height and the width of the tree respectively. Note that the 
concentration parameter a( ) is a function of the depth of 
the tree («(•) : N R+) and is defined to be a{j) = X^ao 
with ao > 0 and A G (0, 1] [19]. 

PhyloSub model 

We follow Shah et al. [10,30] to model the allelic count 
data. For each genetic variant that is detected by high- 
throughput sequencing methods, cells containing the 
genetic variant are referred to as variant population and 
those without the variant as reference population. Let 
E = {A, C, G, T} denote the set of nucleotides. Let ai 
and bi denote the number of reads matching the ref- 
erence allele A g E and the variant allele B g E 
respectively at position /, and let di = ai + bi. The 
genotype g g {A, B, AA, AB, BB, AAA, . . .} would depend 
on the copy number at the variant location. Let /x[ G 
{fif, l^f^i IJif^^i . . .} denote the probability of sampling a 
reference allele from the reference population. This value 
depends on the error rate of the sequencer. Let ^i] denote 
a vector whose entries, /xj'^ g {/x?, /x^^, /x?^, . . .}, are 
the probabilities of sampling a reference allele from the 
variant population with genotype g at position i. Let tt/ 
denote the vector whose entries, nf e {tt?, ttA^, tt?^, . . .}, 
are the probabilities of the variant population at posi- 
tion / to have the genotype g. Let 5/ denote the pseudo- 
count parameters of the Dirichlet prior over tt/. Let G/ 
denote the genotype of the variant population at posi- 
tion /. Let 4>i denote the fraction of cells from the variant 
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population, i.e., the SNV population frequency at posi- 
tion and 1 — 0/ denote the fraction of cells from 
the reference population at position The observation 
model for allelic counts has the following generative 
process [10]: 

e~DP(a,//); ^i^G; 
jti ~ Dirichlet(5/); G/ ~ Categorical (tt/); 
ai I di, Gi =g, 4)i, fi^l ~ Binomial((i/, (1 - + 

(5) 

The posterior distribution of 0/ is 

g 

xp{Gi I 7ri)p(jri I 8i)p{^i), 

Each of the terms appearing inside the summation over 
genotypes is the probability distribution of a Dirichlet 
compound multinomial (with a single draw) [31]. The 
posterior distribution can thus be rewritten as 



p(4)i I •) oc ^ 



r(E^.5f + 1) 



X Binomialfe; (i/, (1 - + (l)iii{^)p{^i), 

(6) 

where r(-) is the Gamma function. 

The Dirichlet process prior DP(Qf, H) in the obser- 
vation model of allelic counts (5) is useful to infer 
groups of mutations that occur at the same SNV popula- 
tion frequency [10]. Furthermore, being a nonparametric 
prior, it is useful to avoid the problem of selecting the 
number of groups of mutations a priori. However, it 
cannot be used to model the clonal evolutionary struc- 
ture which takes the form of a rooted tree. In order 
to model this, we propose to use the tree-structured 
stick-breaking process prior (4) described in the previous 
section. 

The probabilistic graphical model for allelic counts with 
the tree-structured stick-breaking process prior is shown 
in Figure 10. Inputs to the model including the hyperpa- 
rameters are indicated in shaded nodes, whereas the latent 
variables including the set of SNV population frequen- 
cies {0/} are indicated in unshaded nodes. The prior/base 
distribution H of the SNV population frequencies is the 
uniform distribution Uniform(0, 1) for the root node and 

Uniform(0, 0par(v) - Y.weS{v) ^h/) ^^^^^ ^ 

in the tree, where par(v) denotes the parent node of v 
and S{v) is the set of siblings of v. This ensures that 




Figure 10 PhyloSub graphical model for single sample. 

Probabilistic grapliical model for allelic counts with tree-structured 
stick-breaking process prior. Observed variables and hyperparameters 
(inputs to the model) are indicated in shaded nodes. 



the clonal evolutionary constraints (discussed in the next 
section) are satisfied when adding a new node in the 
tree. The crucial difference between our model and the 
model of Shah et al [10] is that we use a tree-structured 
stick-breaking process instead of a Dirichlet process (cf. 
5) to generate the set of SNV population frequencies 
{0/}. Given this model and a set oiN observations/inputs 
{{uiy di, /xV, 5/)}^^, the tree structure and the SNV pop- 
ulation frequencies {0/} are inferred using Markov chain 
Monte Carlo sampling. In particular, we use Gibbs sam- 
pling [19] to generate posterior samples of the SNV pop- 
ulation frequencies 6. Each iteration of the Gibbs sampler 
involves multiple subsampling procedures: sampling clus- 
ter assignments {z/}, sampling stick lengths and 1/^^^., 
sampling stick-breaking hyperparameters q^o, y and A, and 
sampling the SNV population frequencies {0/}. Our main 
algorithmic contribution, described below, is a method to 
sample SNV population frequencies in such a way that 
the tumor evolution proceeds according to the assump- 
tions from the clonal evolutionary theory. The rest of the 
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subsampling procedures follow directly from Adams et al. 
[19] and we refer the reader to it for further technical 
details. 

Sampling SNV population frequencies 

Given the current state of the tree structure, we sam- 
ple SNV population frequencies in such a way that the 
SNV population frequency 0v of every non-leaf node v in 
the tree is greater than or equal to the sum of the SNV 
population frequencies of its children. To enforce this 
constraint, we introduce a set of auxiliary weights {rjy}, 
one for each node, that satisfy rjy = 1, and rewrite 
the observation model for allelic counts 5 explicitly in 
terms of these weights resulting in the following posterior 
distribution: 

p(rji I Ui, du l^i^ Ttu Si) oc ^p(ai \ du Gi = g, rju ti]) 

g 

xp{Gi I 7ri)p(7ri I Si)p(rji), 

(7) 

where we have used {rji} to denote the auxiliary weights 
for each SNV. The prior/base distribution of the auxil- 
iary weights is defined such that it is 1 for the singleton 
root node and Uniform (0, ?7par(v)) for any other node v 
in the tree, where par(v) denotes the parent node of v. 
When a new node w is added to the tree, we sample rj^, ~ 
Uniform (0,^par(M/)) and update r]^^r(w) ^ ^par(w) 
This ensures that rjy = 1. 

This change is crucial as it allows us to design a Markov 
chain that converges to the stationary distribution of {^v}- 
The SNV population frequency for any node v can then be 
computed via 

0v = + ^ ^14/ = + ^ (t>w> (8) 

weT>{v) weC{v) 

where V(v) and C(v) are the sets of all descendants and 
children of node v respectively. This construction ensures 
that the SNV population frequencies of mutations appear- 
ing at the parent node is greater than or equal to the sum 
of the frequencies of all its children. The procedure to gen- 
erate a random sample of SNV population frequencies is 
given in Algorithm 1 where we generate (rjy, (py) for every 
node V by traversing the tree in a breadth-first fashion. 
The input to this algorithm is the current state of the tree 
T = (V,E) where V is the set of vertices and E is the 
set of edges, and the output is a multi-dimensional sam- 
ple of SNV population frequencies 0 = {(pi>(t>2> - - - >(l>\v\} 
(where \ V\ = K) and the corresponding auxiliary weights 
rj = {rji, ri2, . . . , ri\Y\} . A sample from this algorithm is 
shown in Figure 11. 



Algorithm 1 Algorithm to generate SNV population 
frequencies satisfying the assumptions from clonal 
evolutionary theory. 

Input: Rooted tree T = (V,E) with root node r 
Output: r] = {rii,ri2,...,ri\Y\},^ = {0i, 02, • • • , 

1: create a queue Q 

2: Q.enqueue(r) 

3: while Q is not empty do 

4: V = Q.dequeueO 

5: if V is root then 

6: 0v = 1 

7: Sy ~ Uniform(0, 1) 

8: rjy = (j)y ' Sy 

9: end if 

10: my = (py — rjy {mass assigned to children of v} 

11: for c in children of v do 

12: Tc ~ Uniform (0, 1) {distribute mass} 

13: end for 

14: r = niy ' r/J2c 

15: for c in children of v do 

16: 0c = rc 

17: Sc ~ Uniform(0, 1) 

18: rjc = (t)c' Sc 

19: Q.enqueue(c) 

20: end for 

21: end while 

22: return rj, 0 



We use Metropolis-Hastings algorithm to sample from 
the posterior distribution of the auxiliary weights {fji} 7 
as shown in Algorithm 2 and derive the SNV population 
frequencies from these samples. We use an asymmetric 
Dirichlet distribution as the proposal distribution. This 
ensures that the Markov chain converges to the stationary 
distribution of {fji}. The inputs to the sampling algorithm 
are the current state of the tree T = (V, £), a scaling factor 




Figure 1 1 Example of SNV population frequencies generated 
using Algorithm I.The labels of the nodes are its corresponding 
SNV population frequencies and auxiliary weights (0 | j]). Note that 
J2vev = 1 and (pv > JlweCiv) (t>w for every non-leaf node v. 



Jiao etol. BMC Bioinformatics 2014, 15:35 
http://www.bionnedcentral.conn/l 471 -21 05/1 5/35 



(7, and the number of iterations T, The output is a sample 
from the posterior distribution of ?y = {^i, • • • > 
and its corresponding 0 = {0i, • • • > ^\v\}' 



Algorithm 2 Metropolis-Hastings algorithm to sample 
from the posterior distribution of the auxiliary weights 
{rjy} and compute the SNV population frequencies {0v}* 
Input: Rooted tree T = {V>E), a, T 
Output: rj = {^i,^2>...>^|y|}>0 = {0i>02>...>0|y|} 

1: InitiaUze rj^^^ using Algorithm 1 
2: for /: = 1 : r do 

3: //draw a proposal state from the Dirichlet distribu- 
tion with density function Q(-) 
4: ~ Dirichlet(or?y^^"^^ + 1) 
5: // accept/reject state 



7 


r ~ Uniform(0, 1) 


8 


if log(r) < a then 


9 


yjit) ^ 


10 


else 


11 




12 


end if 


13 


end for 


14 


Compute 0 from rj 8 


15 


return jy^^\0^^^ 



Extension to multiple tumor samples 

PhyloSub (cf. Figure 10) can be easily extended for mul- 
tiple tumor samples. We allow the tree-structured stick- 
breaking process prior 4 to be shared across all the 
samples. Let a^- and b^- denote the number of reads match- 
ing the reference and the variant allele respectively at 
position / for sample t e {1, . . . , 5'}, and let d^- = a\ -\- b\. 
Let 4>\ denote the fraction of cells from the variant popu- 
lation, i.e., the SNV population frequency at position / for 
sample t, and r}\ denote its corresponding auxiliary weight. 
The graphical model of PhyloSub for multiple tumor sam- 
ples is shown in Figure 12. The main technical difference 
between the single and the multiple sample models lies in 
the sampling procedure for SNV population frequencies. 
In the multiple sample model, we ensure that the clonal 
evolutionary constraints described in the previous section 
are satisfied separately for each tumor sample and then 
make a global Metropolis-Hastings move based on the 
distribution 0^=1 /^(^^ I *)> where {jy^, /y^, . . . , n]^} is the set 
of auxiliary weights for all the tumor samples. 

Partial order plot 

We construct a partial order plot to summarize and visu- 
alize the trees from all the posterior MCMC samples. It is 
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Figure 12 PhyloSub graphical model for multiple samples. 

Probabilistic grapliical model for allelic counts from multiple samples 
with a shared tree-structured stick-breaking process prior. Observed 
variables and hyperparameters (inputs to the model) are indicated in 
shaded nodes. 



important to note that the nodes of this partial order plot 
are the SNVs and not the SNV clusters. The thickness of 
a directed edge P ^ Q in the tree is proportional to the 
fraction of MCMC samples in which SNV P first appears 
in a subclonal lineage that was the parent of the subclonal 
lineage that Q first appears in. The color of the border 
of the SNVs represents the subclonal lineage cluster that 
the SNV is placed into post hoc using an algorithm called 
correlation clustering [32]. Note that the main purpose 
of this clustering algorithm is only to color the nodes in 
the partial order plot by aggregating the clustering infor- 
mation from all the MCMC samples obtained from our 
model; this clustering is a summary but does not repre- 
sent any (possibly quite large) uncertainty in the cluster 
assignments. The input to this algorithm is a symmetric 
N X N co-clustering matrix C, whose elements Qy is the 
difference between the number of samples in which / and 
7 were assigned to the same SNV cluster and the num- 
ber of samples in which / and ; were assigned to different 
SNV clusters. The algorithm estimates a symmetric NxN 
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cluster indicator matrix Y, whose elements Y/y = 1 if / and 
7 are assigned to the same cluster and Yij = 0 otherwise. 
This cluster indicator matrix Y has all the information 
about the number of SNV clusters as well as the SNVs 
assigned to each of them. 

MCMC settings 

In all the experiments, we fix the number of MCMC iter- 
ations to 5,000 with a burn-in of 100 samples. We also fix 
the number of iterations in the Metropolis-Hastings algo- 
rithm to 5,000 and set the scaling factor for the Dirichlet 
proposal distribution to a = 100. We run the MCMC 
samplers multiple times with different random initializa- 
tions and pick a single run based on the complete-data 
likelihood trace and its auto-correlation function. We use 
all the 5,000 samples without thinning [33] to construct 
the partial order plots. We use the CODA R package [34] 
for MCMC diagnostics to monitor the convergence of 
the samplers. The complete-data log likelihood traces and 
the corresponding autocorrelation function plots after the 
burn-in period of 100 samples for all the experiments on 
AML and CLL datasets are shown in (Additional file 2: 
Figures S3 to S7). 

Datasets and inputs to PhyloSub 

All datasets used in the experiments, including details 
about the inputs to PhlyoSub, i.e., the set of observations 
{{audu ^y-y^i)}f^^y zxQ provlded in the (Additional 
file 1: Tables S5 - SIO). 

Additional files 



Additional file 1 : Supplementary tables. This file contains 
supplementary Tables SI to SIO. 

Additional file 2: Supplementary figures. This file contains 
supplementary Figures SI to S7. 
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